This code loads data from a CSV file and stores it in a Pandas DataFrame.

4.1. Descriptive Analysis

Dataset contains 50,012 data points or observations, and each data point has 61 features

This code calculates and displays the percentage of missing values in each column of the data DataFrame.

This code provides insights into the data quality by showing the proportion of missing values in each feature. It is important for data preprocessing and deciding how to handle missing data, whether through imputation or removal.

There are no duplicated rows in the dataset. Each row in the DataFrame is unique.

To calculate summary statistics, describe() function can be used. This will provide measures of central tendency (mean, median) and dispersion (standard deviation, variance) for each numerical column in the dataset.

Standard deviations measure the variability of closing prices. For instance, "AKBNK" has a relatively low standard deviation (approximately 0.94), suggesting that its prices tend to be less volatile, while "VESTL" has a higher standard deviation (approximately 2.55), indicating greater price fluctuations.

Outliers: The large gap between the maximum and the 75th percentile in some stocks, such as "VESTL," "YUNSA," and "YKBNK," suggests the presence of outliers or extreme values.

The code removes the row with the minimum value (probably false data) in the dataset. Since there is exception such as min value is outlier for every column.

'timestamp' column is converted to datetime and used as the index for the Data. This is particularly useful when working with time-series data.

The provided code snippet performs various analyses and visualizations on numerical columns in the data, including calculating skewness and kurtosis, as well as creating line plots, histograms, kernel density estimation plots, and box plots for each column. This is used to quickly analyze and visualize the numerical columns in dataset, exploring their distribution and statistical properties.

The provided code snippet creates a bar plot to visualize the skewness and Kurtosis values of different stocks over time. It uses the skewness and Kurtosis values calculated for each stock and displays them in a bar chart.

Negative skew refers to a longer or fatter tail on the left side of the distribution, while positive skew refers to a longer or fatter tail on the right.

So, GOODY have very large skewness value, it means that this is directioned to the right. Besides, ISDMR and TTKOM shows negative skewness, they are on the left side of the distribution.

Positive kurtosis indicates heavier tails and a more peaked distribution, while negative kurtosis suggests lighter tails and a flatter distribution.

So, GOODY have very large kurtosis value, it means that this has more peaked shape.

4.2. Moving Window Correlation

This code snippet fills missing values in the data DataFrame using forward-fill (ffill) method.

This is the heatmap of the correlation matrix of the stocks.

To choose pairs of stock for calculating their correlations over moving window, we can look at their correlation matrix at the beginning. For this purpose, I used below code. This code allows us to identify and extract pairs of stocks in dataset that have high correlations. I tried to choose stocks that are in the same sector with high correlation.

This code is used to visualize how the correlation between two stocks ('AKBNK' and 'YKBNK') changes over time by using a moving window approach. It's particularly useful for analyzing the evolving relationship between two assets in a time series context. In here window size is determined as 20x26 since there are 20 days in each month and 26 operations in each day. It is aimed to take monthly interval. Moving window approach means that for each row, it computes the correlation coefficient of the previous “20x26” rows of columns “stock1” and “stock2”.

If the plot shows a strong positive correlation, it indicates that the two variables move in the same direction within the given window. This suggests a positive relationship.

Plot in the below demonstrates the high positive correlation between AKBNK and YKBNK

Peaks in the plot indicate periods of strong or weak correlation. This can be valuable information for understanding when and how the variables are related. There are 3 negative peaks in the plot.

Stable Correlation: Look for periods where the correlation remains relatively stable. This suggests a consistent relationship between the two variables.

High Correlation: Identify periods with unusually high positive correlation values. These periods suggest that the two variables move closely together within the window, indicating a strong positive relationship.

This code is used to visualize how the correlation between two stocks ('ARCLK' and 'VESTL') changes over time by using a moving window approach.

Plot in the below demonstrates the positive correlation between ARCLK and VESTL

4.3. Principal Component Analysis (PCA)

I dropped this column since their null value ratio are high.

PCA is a dimensionality reduction technique that can help identify the most important features in data and reduce its dimensionality. It's often used for visualizing data in a lower-dimensional space while retaining most of the data's variance.

scaled_data = scaler.fit_transform(data) scales (standardizes) the original data using the StandardScaler. This step ensures that each feature has a mean of 0 and a standard deviation of 1.

PCA(n_components=2) initializes a PCA object with the number of principal components set to 2. This means that PCA will reduce the dimensionality of the data to retain only the top 2 principal components. At the beginning of this part, I want to see how would 2 components cover the dataset.

print(pca.explained_variance_) prints the explained variance of each principal component.

print(pca.explained_variance_ratio_) prints the explained variance ratio of each principal component, which indicates the proportion of total variance explained by each component.

np.cumsum(pca.explained_variance_ratio_) computes the cumulative explained variance. This array shows how much variance is explained by adding successive principal components.

The result [0.51344291 0.17537393] obtained from pca.explained_varianceratio means that when we perform Principal Component Analysis (PCA) on data, the first principal component (PC1) explains approximately 51.34% of the total variance in the data, and the second principal component (PC2) explains approximately 17.54% of the total variance. These values represent the proportion of the total variance in the data that is captured by each principal component. In this case, the first principal component captures the majority of the variance, and the second component captures a smaller portion. The sum of these explained variances indicating that these two components collectively account for 68.88% (approximately) of the total variance in the data.

We can interpret this result as follows: PC1 explains the most significant and common pattern or variation in the data. PC2 captures the second most significant pattern, which is orthogonal (uncorrelated) to PC1.

This code snippet generates a scatter plot to visualize the result of PCA with two principal components.

After 2 components PCA, we saw that 2 principal components are not sufficient to explain majority of the total variance in the data. For this purpose, plot in the below helps us to determine how many principal components is enough to capture a desired amount of variance in the data. In this case, the red line at 90% cumulative explained variance provides a threshold for dimensionality reduction. As can be seen from the graph, when we select approximately 5 components, we reach a level that can represent 90% of the total data.

The result [0.51344535 0.17532958 0.10123518 0.05238714 0.03712662] obtained from pca.explained_varianceratio means that when we perform Principal Component Analysis (PCA) on data, the first principal component (PC1) explains approximately 51.34% of the total variance in the data, and the second principal component (PC2) explains approximately 17.54% of the total variance. Third, fourth and fifth principal component explains approximately 10.1%, 5.2% and 3.7% of the total variance, respectively. The sum of these explained variances indicating that these five components collectively account for 87.9% (approximately) of the total variance in the data.

In the context of stock price data, the first few principal components can represent latent variables capturing underlying patterns in the data. First Principal Component (PC1): PC1 captures the most significant and common trend or pattern shared by the stock prices. It might represent a general market trend affecting all the stocks. Second Principal Component (PC2): PC2 captures the second most significant pattern that is orthogonal to PC1. It represents variation that is independent of the first component.

This code snippet creates a bar plot to visualize the variance explained by each principal component.

4.4. Inference with Google Trends:

The provided CSV file contains data from Google Trends related to the search term "Tüpraş." Google Trends is a service offered by Google that allows users to analyze the popularity and search interest of specific keywords or terms over time.

This code snippet creates a plot to visualize the relationship between the stock price of TUPRAS and the Google Trend search volume for the term "Tüpraş" over time. When we look at the plot, we can say that there is apattern between Tüpraş Stock price and search volume from google Trend. Besides, there is a high peak at 2017 October in both plot.

When we search "Tüpraş" with this specific date, we can find a lot of news about an explosion in factory of Tüpraş. Interestingly, after that explosion, the stock price increased.

image.png

Appendix

image.png